Skip to content

agent: add membw command for bare-metal DDR bandwidth test#100

Merged
widgetii merged 1 commit into
masterfrom
agent-membw
May 14, 2026
Merged

agent: add membw command for bare-metal DDR bandwidth test#100
widgetii merged 1 commit into
masterfrom
agent-membw

Conversation

@widgetii
Copy link
Copy Markdown
Member

Summary

Closes #99.

  • New CMD_MEMBW (0x0D) / RSP_MEMBW (0x87) — runs memset / read-scan / memcpy kernels via ARM ldmia/stmia (r4-r11, 32 B per loop iter) against a scratch DDR buffer with the MMU cache on, timed with the ARMv7 PMU cycle counter (CCNT).
  • New defib agent membw [--size 4MB] [--iters 8] [--addr 0] [--port ...] [--output human|json] CLI command.
  • Reports cycles/byte (CPU-clock-invariant — the metric that actually isolates DDR fabric from CPU clock variance) plus MB/s when the architectural generic timer's CNTFRQ is set by an earlier boot stage. If CNTFRQ == 0 the host transparently falls back to cycles/byte.
  • Agent v4: bumps AGENT_VERSION, advertises CAP_MEMBW in INFO so the host can check support before sending the command.
  • ARMv7 (Cortex-A7 V4 / V5 / V6 family) only. ARMv5 (ARM926, hi3516cv300) cleanly rejects with ACK_FLASH_ERROR via #ifdef CPU_ARM926 — different PMU register layout, out of scope for the motivating use case.

Why

From the issue: when investigating an encoder fps gap between OpenIPC and vendor firmware on identical gk7205v300 silicon, the key question — "is the DDR fabric slow, or is Linux slow on top of it?" — can't be cleanly answered from inside Linux. CMA reservations, cache attributes, libc memcpy variance and scheduler noise all muddy any userspace number.

defib already runs a bare-metal agent in DDR right after SPL brings memory up. That's the exact moment we want to measure raw DDR throughput, before any kernel/ISP/VENC traffic. defib agent membw gives a reproducible apples-to-apples bandwidth number per firmware.

How

Agent C (agent/main.c, agent/protocol.h)

Three inline-asm kernels with ldmia/stmia over r4-r11 (8 words = 32 B per memory operation), so OpenIPC vs vendor builds produce identical instruction streams. Cache is on (write-back / write-allocate per startup.S page-table fill); the buffer is sized well above L1+L2 so DDR is the actual bottleneck.

CCNT is calibrated against CNTPCT (architectural generic timer, fixed frequency from CNTFRQ) over a 10 ms window. If CNTFRQ was never written by the bootrom — and on the V4 family it isn't — the agent returns timer_hz = 0 and the host falls back to the cycles/byte metric. That number alone already answers the original question because it normalises for CPU-clock differences across firmwares, which is the gotcha that bit the reporter in the original investigation.

Agent footprint guard

The default scratch sits at LOAD_ADDR + 8 MiB (a new AGENT_LOAD_ADDR macro is passed in via Makefile CFLAGS). handle_membw rejects any user-supplied addr whose [addr, addr + 2*size) range overlaps [LOAD_ADDR - 64 KB, LOAD_ADDR + 8 MiB] — otherwise an 8 MiB memcpy on the default V4 layout would stomp the running agent's own code. This was found during real-hardware testing — see the validation section.

Python host (src/defib/agent/client.py, cli/app.py)

  • MembwResult dataclass with cycles_per_byte(ticks, write_amp=1) and mbps(ticks, write_amp=1) helpers (returns None for mbps when timer_hz == 0).
  • FlashAgentClient.membw(size_bytes, iters, addr) async method.
  • defib agent membw Typer command with human and json output modes.
  • agent info now lists membw in the capabilities line when reported by the agent.

Tests

  • Agent C (agent/test_agent.c): round-trip framing tests for the 12 B request and 32 B response packets.
  • Python (tests/test_agent_protocol.py::TestMembw): four tests using MockTransport — field parsing, MB/s + cycles/byte math, timer_hz == 0 graceful degradation, ARMv5 (ACK_FLASH_ERROR) rejection path.

Validation

Real hardware, 2026-05-14:

Test hi3516ev300 (V4) gk7205v300 (V4)
Agent v4 advertises membw
memset 4 MiB × 8 0.345 cyc/B 0.345 cyc/B
read 4 MiB × 8 0.512 cyc/B 0.513 cyc/B
memcpy 4 MiB × 8 (R+W) 0.446 cyc/B 0.446 cyc/B
8 MiB × 16 + 16 MiB × 8 flat to 0.2% — past cache

Both SoCs agree to 0.2% — expected, same V4 silicon family with the same DDR config. CNTFRQ == 0 on both, so MB/s shows n/a and the cycles/byte fallback activates automatically.

Tests / lint / cross-build (all green):

  • make -C agent test HOST_CC=gcc — 5412/5412 pass (includes 2 new framing tests)
  • uv run pytest tests/ -x --ignore=tests/fuzz — 494 pass, 2 skip (includes 4 new TestMembw tests)
  • uv run ruff check src/ tests/ — clean
  • uv run mypy src/defib/ --ignore-missing-imports — clean
  • Cross-build verified: gk7205v300, hi3516ev300, hi3516cv300 (ARMv5 reject path), hi3516cv610; make all-socs builds all four default targets.

Test plan

  • Agent C unit tests pass (make -C agent test HOST_CC=gcc)
  • Python tests pass (uv run pytest tests/)
  • Ruff + mypy clean
  • Cross-compile every default SoC
  • Real-hardware smoke on hi3516ev300 (ARMv7)
  • Real-hardware smoke on gk7205v300 (ARMv7, the motivating SoC)
  • Real-hardware edge cases: 4 MiB×8, 8 MiB×16, 16 MiB×8 — cycles/byte stable
  • Real-hardware smoke on hi3516cv300 (ARMv5 reject path — needs an ARMv5 board)
  • Run from OpenIPC U-Boot and vendor U-Boot on the same gk7205v300 silicon, diff cycles_per_byte — that's the motivating measurement the issue was asking for.

🤖 Generated with Claude Code

Adds CMD_MEMBW (0x0D) that runs memset / read-scan / memcpy via ARM
ldmia/stmia kernels (r4-r11, 32 B per loop iter) against a scratch DDR
buffer with the MMU cache on, timed using the ARMv7 PMU cycle counter
(CCNT). Reports cycles/byte — CPU-clock-invariant, the metric that
actually isolates DDR fabric — plus MB/s when the architectural
generic timer's CNTFRQ is set by an earlier boot stage.

Motivating use case (issue): comparing OpenIPC vs vendor U-Boot on
the same gk7205v300 silicon to determine whether an encoder fps gap
comes from DDR fabric or from Linux software stack. cycles/byte from
membw answers that without any of the userspace cache-attr / CMA /
libc-memcpy confounders.

Bumps AGENT_VERSION to 4 and advertises CAP_MEMBW in INFO so the
host can check support. ARMv7 only (V4 / V5 / V6 family); ARMv5
(ARM926, hi3516cv300) cleanly rejects with ACK_FLASH_ERROR — different
PMU register layout, out of scope for the motivating case.

Default scratch is placed at LOAD_ADDR + 8 MiB (passed in via the new
AGENT_LOAD_ADDR macro from the Makefile) with a guard that rejects
any user-supplied addr where [addr, addr + 2*size) overlaps
[LOAD_ADDR - 64KB, LOAD_ADDR + 8 MiB] — otherwise an 8 MiB memcpy
on the default V4 layout would stomp the running agent's own code.

Validated end-to-end on hi3516ev300 + gk7205v300: cycles/byte stable
to 0.2% across 4 MiB×8, 8 MiB×16, 16 MiB×8 runs, matching across both
SoCs (same V4 silicon family). CNTFRQ is 0 on both — bootrom doesn't
initialise the generic timer — so the cycles/byte fallback path is the
one exercised in practice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@widgetii widgetii merged commit 5d90121 into master May 14, 2026
13 checks passed
@widgetii widgetii deleted the agent-membw branch May 14, 2026 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add DDR bandwidth test to U-Boot agent (bare-metal memset/memcpy/read)

1 participant